Introduction:

The aim of this dataset is to predict a given individual’s yearly medical insurance bill in America. I got my dataset from Kaggle. The data contains 6 predictors: age, sex, bmi, number of children, smoker status, and region. I will be fitting multiple models to this regression problem in order to determine an individuals yearly medical insurance cost as well as what factors most greatly affect a person’s medical insurance cost.

Background/Why is this model relevant?:

Medical insurance is a greatly disputed topic in America as we charge much more for health care than other countries. Additionally, healthcare and hospital costs are increasing, making this a prominent issue as people are in more urgent need to know their medical insuranace costs, to see if they can afford it or if they qualify for aid from the government, for example. The cost of certain healthcare products (such as drugs) depend on external factors, such as market forces. This creates some uncertainty and variability in a person’s medical bill because some people might be more susceptible to requiring these drugs, but the cost of the drugs is changing based on the market. I am hoping that I can fit a model that best identifies how much a person will have to pay yearly in medical insurance. I believe this is relevant knowledge as healthcare is especially costly in America, and poeple should be aware of how much they will have to pay in case they need to apply for aid from the government to pay their medical bills.

Loading Data and Packages

I obtained the dataset from Kaggle. The data contains 7 columns (6 predictor variables and 1 outcome variable) and 1338 rows. The following provides a description of the 7 variables:

1.charges: The outcome variable describing the amount an individual has to pay in medical insurance in dollars yearly. 2.age: The age of the primary beneficiary. 3.sex: The gender of the insurance contractor, either male or female. 4.bmi: Body mass index of primary beneficiary 5.children: Number of children covered by the health insurance (also known as number of dependents) 6.smoker: Whether the person smokes or not: yes or no 7.region: Beneficiary’s location of residence in the US: northeast, southeast, southwest, or northwest

Data Cleaning and Splitting

## # A tibble: 1,338 × 7
##      age sex      bmi children smoker region    charges
##    <dbl> <chr>  <dbl>    <dbl> <chr>  <chr>       <dbl>
##  1    19 female  27.9        0 yes    southwest  16885.
##  2    18 male    33.8        1 no     southeast   1726.
##  3    28 male    33          3 no     southeast   4449.
##  4    33 male    22.7        0 no     northwest  21984.
##  5    32 male    28.9        0 no     northwest   3867.
##  6    31 female  25.7        0 no     southeast   3757.
##  7    46 female  33.4        1 no     southeast   8241.
##  8    37 female  27.7        3 no     northwest   7282.
##  9    37 male    29.8        2 no     northeast   6406.
## 10    60 female  25.8        0 no     northwest  28923.
## # … with 1,328 more rows
any(is.na(bill))
## [1] FALSE

The original data does not contain any missing/NA values.

## # A tibble: 1,070 × 7
##      age sex      bmi children smoker region    charges
##    <dbl> <fct>  <dbl>    <dbl> <fct>  <fct>       <dbl>
##  1    28 male    33          3 no     southeast   4449.
##  2    32 male    28.9        0 no     northwest   3867.
##  3    25 male    26.2        0 no     northeast   2721.
##  4    23 male    34.4        0 no     southwest   1827.
##  5    19 male    24.6        1 no     southwest   1837.
##  6    23 male    23.8        0 no     northeast   2395.
##  7    30 female  32.4        1 no     southwest   4150.
##  8    18 male    34.1        0 no     southeast   1137.
##  9    23 male    17.4        1 no     northwest   2775.
## 10    18 female  26.3        0 no     northeast   2198.
## # … with 1,060 more rows
## # A tibble: 268 × 7
##      age sex      bmi children smoker region    charges
##    <dbl> <fct>  <dbl>    <dbl> <fct>  <fct>       <dbl>
##  1    19 female  27.9        0 yes    southwest  16885.
##  2    18 male    33.8        1 no     southeast   1726.
##  3    31 female  25.7        0 no     southeast   3757.
##  4    37 female  27.7        3 no     northwest   7282.
##  5    37 male    29.8        2 no     northeast   6406.
##  6    27 male    42.1        0 yes    southeast  39612.
##  7    56 male    40.3        0 no     southwest  10602.
##  8    37 male    28.0        2 no     northwest   6204.
##  9    59 female  27.7        3 no     southeast  14001.
## 10    22 male    35.6        0 yes    southwest  35586.
## # … with 258 more rows

I converted all character variables to type factor so that they will be easier to use when modeling and creating visualizations. I split the data into a training and testing set, and stratified on the outcome variable, charges, so that we have a well proportioned range of charges per set. The training dataset contains 1,070 observations, while the testing dataset contains 268 observations. Since we only have 6 predictor variables that all seem particularly useful, we do not have to worry about eliminating any non important variables from the dataset. Instead, let’s start exploratory data analysis on the training set.

EDA

Let’s start of by creating a correlation matrix of the numeric variables in order to see which one’s seem to be more closely related with our outcome variable, charges and examine their relationship further.

Correlation Matrix

Here we can see that age and BMI have a noticeable positive correlation with charges. I am surprised that the number of children a beneficiary has does not seem to affect insurance cost that greatly. This is surprising because I assumed that each kid would mean another insurance bill, but perhaps there are plans for families, so that they do not have to pay unreasonably expensive amounts of money.

Age

Now let’s look at the distribution of age, which I believe will be an important predictor variable. The bar plot indicates that we have a lot of 17-19 year olds in our training set, but the count of the rest of the ages are relatively the same. This is important to note because we do not want an imbalanced class, meaning we do not want only a large amount of 20 year olds (for example) in our training set as our model will only learn to predict the insurance of bill of 20 year olds and won’t be able to as accurately predict costs for other ages. Let’s look at how age compares with insurance costs, as the correlation matrix indicated that age is the predictor with the highest correlation to charge.

Based off this simple line plot we can see that as age increases, insurance charges also increase in an positive linear trend. However, there seem to be three different trends, with around 5 outliers that indicate high insurance charges. It makes sense that as age increases, so does insurance cost as people are more prone to illnesses as they grow old, resulting in higher insurance bills. It is interesting to note the three different trends we see: for example, a 20 year old’s insurance cost can fall in either three categories of around $1000, around $2000, or around $4000.

BMI

The correlation matrix also indicated that the BMI of an individual also plays a big factor in their insurance bill as it indicates how healthy their body weight is, in turn indicating how prone they are to health risks. We can see that the median BMI of the individuals from out training set is slightly over 30. A BMI of 30 is considered as overweight according to the National Health and Nutrition Survey [cite this]. This makes sense for out dataset as it is obtained in America, a country known for having high obesity rates. We have a few outliers towards the 45 - 50 BMI range, most likely indicating a high medical insurance bill. This also aligns with the outliers we saw in our scatter plot, as certain individuals had significantly high insurance bills in comparison to the rest of their age group.

Let’s see how insurance charges directly compare with BMI values. There seems to be no clear trend between BMI and insurance charges. Let’s see if a trend emerges if we split this scatter plot based on the sex variable. The trend remains relatively the same with males having a greater number of higher insurance charges. This leads me to conclude that BMI does not have a significant effect on yearly insurance costs. Just to double check, let’s see if there is a trend in insurance costs and BMI when split by region.

There seems to be no apparent trend here. One particular observation is that the arrangement of points between the southern regions is similar to each and the points between the northern regions are more similar to each other. For example, the northeast and northwest regions both seem to have a similar amount of points in the 30 - 40 BMI range. This makes sense as insurance charges are more likely to be similar in regions closer to each other.

Region

I believe that an individuals location of residence (region) also plays a big factor in insurance costs as some places have higher cost of healthcare, resulting in a higher bill. Let’s look at the distribution of the regions we have. We have around the same amount of observations per region. Now lets see how insurance charges differ per region. The scatterplot form is displaying a similar pattern per region. To gain more insights into general insurance cost trends across region, let’s create a new column reassigning the charges to new values based off a range that reassigns charges to one of 6 values. These ranges are going to be determined by percentiles using the mean and standard deviation.

##     charges_new
## 10%    2330.629
## 20%    3989.458
## 30%    5555.493
## 40%    7442.786
## 50%    9373.744
## 60%   11444.155
## 70%   13662.588
## 80%   20189.108
## 90%   34316.836
## [1] 2330.629

The above visualization is a histogram depicting counts of how many charges fall within a certain percentile range. For example, in the northeast, about 29 insurance charges fall in the 80th percentile but above the 70th percentile, meaning 29 individuals there pay an insurance cost greater than 80% of the population, indicating a high insurance cost. The 10 indicates the number of charges that fall greater than the 90th percentile; these are the highest insurance charges.This allows us to see that the southwest region has relatively lower insurance charges as there are not many insurance charges that fall in the 90th - greater than 90th percentile region. The southeast region, on the other hand, has a significantly higher number of charges that fall above the 90th percentile. However, the southeast also has the highest number of charges falling below the 10th percentile. Looking at the distribution chart created in the beginning, we see that this is due to the fact that the southeast simply has more observations, so it makes sense that it has more counts in certain percentiles. Overall, the counts are pretty even across region and percentiles. This indicates that region is most likely not an extremely important predictor in determining health insurance costs.

Just to double check, let’s look at the standard deviation and average insurance charges across all regions.

##   region    average insurance cost standard deviation
## A northeast 13256.5469111923       10883.9694023189  
## B southeast 14540.1674274216       13621.30163839    
## C northwest 12496.1164836015       11205.6792015472  
## D southwest 12247.0496788168       11328.7241409813

Here we can see that the southeast has a significantly higher insurance cost. This is further supported by the histogram we made above indicating that majority of the individuals living in the southeast pay insurance charges greater than the 90th percentile (indicated by the 10 on the x axis), meaning they pay insurance costs greater than 90% of the population. The standard deviation, however, is also fairly large for the southeast region. This is also supported by the above histogram as the second highest bar was the below the 10th percentile bar. Therefore, it makes sense that the standard deviation is so high as the charges for that region are fairly spread out. The standard deviation and mean for the rest fo the regions are relatively the same. This leads to the conclusion that there seems to be no significant effect of region on insurance costs.

Smoker

Now let’s take a look at how the smoker status of an individual affects his/her medical insurance cost. First, let’s start off by checking whether we have an imbalanced class or not by looking at the distribution of the smoker variable. We have a significantly higher number of non smokers in our testing dataset. Since smoker is a predictor variable, and not our target variable, it is not detrimental to our analysis/model if it is imbalanced. Let’s see how much percentiles of charges vary based on smoker status. Similar to the last plot we looked at, but this time we are looking at smoker status as opposed to region.

Recall, the higher the percentile, the more expensive the medical insurance bill is. Here we can see that majority of the smoking population has an insurance bill that falls above the 90th percentile (indicated by 10 on the x axis). Among the non smoking population, however, we can see that majority of them fall in 60th percentile in insurance bills. Therefore, it is reasonable to assume that a person’s smoker status affects how much they have to pay yearly in medical insurance.

Sex

When looking at the distribution of charges between male and female, they seem to both have the same median of a little below $10,000. However, the male population has a significantly higher 75th percentile value. This means that 75% of insurance costs for males fall below $19,000 a year, while 75% of female insurance costs fall below $15,000 a year. This means that males seem to have to pay higher insurance rates a year. Let’s see whether these values differ by region. The above diagram shows the distribution of charges per region, separated by sex. We can see that the southeast has significantly higher charges for males than any other region. However, females in the northeast have a higher median insurance cost than any other females in other regions. Another interesting thing to note is that males have a higher average insurance charge for every region except for the northwest where they fall slightly below the females. While these are interesting observations to note, they do not signify a clear trend or pattern between sex and charge or region and charge or the intersection between the three of them. Let’s take a look at the averages of insurance charges between males and females overall and see if they differ drastically.

##   sex    average insurance cost standard deviation
## A male   13879.2429471667       12920.1889691683  
## B female 12443.683915434        10671.6975185499

As expected, males have a higher average insurance cost by a little over $1000. They also have a slightly higher standard deviation. Overall, I believe sex plays a role in determining how much an individual has to pay in health insurance, but it does not have as significant of an effect that age does. I am curious to see what a histogram of age vs charges separated by sex looks like.

The overall trends remain the same for each gender, yet males seem to have more points in the $30,000 to $40,000 range. This is expected as males have a higher average insurance rate. Also, we can see that both genders have similar number of outliers (three for females, and two for males). Here we can see the clear difference in charges between people who do smoke vs people that do not smoke that is apparent in every region. One unique thing to note is that the median charge for smokers in the northwest is considerably lower than the median insurance charge for all smokers in different regions.

Children

As the last component of our EDA, let’s take a look at how the number of children a beneficiary covers affects his/her insurance bill. The correlation matrix indicated that charges and children did not have a high correlation. This was surprising to me as I thought with an increased number of children a beneficiary has to cover, the price of the insurance bill also increases in order to account for each additional person. Here we can see that majority of the individuals in our training set are not covering other individuals. Let’s look at mean and standard deviation to see just how much the averages differ per number of children.

##   number of children average insurance cost standard deviation
## A zero               12283.7132616376       11675.7584313941  
## B one                12377.2423431559       11295.9968752119  
## C two                15631.040996455        13316.6450266661  
## D three              14696.82651528         11863.0073040667  
## E four               14245.0147609524       9576.92581529077  
## F five               8448.06313214286       3043.13882000087

These values were not what I expected them to be. I expected individuals with more children to have higher yearly insurance costs as they have more people to cover. It turns out, however, that people with 3 children have to pay the highest in average insurance cost yearly, while people with 5 children pay the most. There does not seem to be a clear trend to indicate that with an increase in the number of children comes an increase in average insurance cost. This matches the assumption made in the correlation matrix in the beginning as we can see that charges and children has a very low correlation. Let’s see if there is any trend between charges, children and smoker status. We can see that individuals that smoke have noticeably higher medical insurance charges regardless of the number of children. The ordering of the y axis (from top to bottom) indicates the greatest to least average charges per number of children. Once again, we can see that there is no trend indicating that the more children you have the more one pays in medical insurance. Based off this graph and the correlation matrix created at the beginning of the EDA, we can conclude that the number of children one covers does not seem to directly affect insurance costs.

Model Building

Luckily, the original dataset was already clean, so we do not need to worry about cleaning it. I already split the training and testing data prior to EDA, so all we need to do now is create a recipe to fit our data to models.

Recipe Building

In order to build a recipe for our model I am going to include all 6 original predictors the dataset came with. We are going to use step_normalize() to center and scale all our predictors. I will also be using step_dummy() to convert the independent categorical variables to dummy variables.

K Fold Cross Validation

Now we are going to use cross validation because it reduces computation time, reduces bias, and the variance of the estimate decreases as k increases. We will be stratifying on the outcome variable, charges, so that we do not have imbalanced classes per fold.

Model Fitting

In order to fit our recipe to models, we are going to build a workflow, specify the engine, and set mode to regression. Due to the relatively small number of predictors I have, I figured I would start off with more simple models such as linear regression and logistic regression and eventually work my way up to random forest and regression tree. I chose to fit logistic regression over ridge regression as ridge regression is best used when there is collinearity between predictors. The results from the correlation matrix at the beginning of the EDA indicate low correlation between predictors, so I ruled out ridge regression. I have also decided to use lasso regression as I believe some predictors, such as region, will not be particularly useful, and lasso regression is best used for feature selection as well. We will be assessing model performance primarily through rmse (root mean square deviation) and rsq (R squared) metrics as these are most commonly used to assess regression models.

First let’s start off with linear regression.

#Linear Regression

## # A tibble: 9 × 5
##   term             estimate std.error statistic   p.value
##   <chr>               <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)       13168.       182.    72.5   0        
## 2 age                3532.       184.    19.2   5.77e- 71
## 3 bmi                2198.       193.    11.4   1.86e- 28
## 4 children            634.       182.     3.48  5.28e-  4
## 5 sex_male            -20.2      183.    -0.110 9.12e-  1
## 6 region_northwest   -122.       224.    -0.545 5.86e-  1
## 7 region_southeast   -528.       234.    -2.25  2.45e-  2
## 8 region_southwest   -381.       225.    -1.69  9.18e-  2
## 9 smoker_yes         9415.       183.    51.5   8.01e-291
## # A tibble: 6 × 1
##   .pred
##   <dbl>
## 1 6844.
## 2 5693.
## 3 3257.
## 4 4821.
## 5  806.
## 6 1897.
## # A tibble: 6 × 2
##   .pred charges
##   <dbl>   <dbl>
## 1 6844.   4449.
## 2 5693.   3867.
## 3 3257.   2721.
## 4 4821.   1827.
## 5  806.   1837.
## 6 1897.   2395.

## # A tibble: 2 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard    5919.   
## 2 rsq     standard       0.751

Due to the fact that many of the points do not fall on the dotted line, we can conclude that our linear regression model did not fit well. Let’s see if lasso regression performs better as it drops any predictors deemed unimportant. We also have an extremely high rmse value of 6013. An rmse score between 0.2 and 0.5 is ideal as it indicates that the model can accurately predict data.

Lasso Linear Regression

## # A tibble: 5 × 5
##   penalty .metric  mean     n std_err
##     <dbl> <chr>   <dbl> <int>   <dbl>
## 1    79.1 rmse    5961.    10    67.5
## 2    59.6 rmse    5961.    10    66.8
## 3    45.0 rmse    5962.    10    66.3
## 4    33.9 rmse    5963.    10    65.9
## 5   105.  rmse    5963.    10    68.4

The above visualization indicates that as the amount of regularization approaches 15, the rmse reaches a peak and then sharply decreases. Meanwhile, the rsq hits a low and starts rapidly increasing. We want a model with low rmse and high rsq as that indicates good model fit. The best rmse we see is at 6035.118. While rmse value is lower than our previous linear regression model, it is still extremely high, indicating poor model fit.

Regression Tree

## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard       5198.
## Warning: Cannot retrieve the data used to build the model (model.frame: 'data' must be a data.frame, not a matrix or an array).
## To silence this warning:
##     Call rpart.plot with roundint=FALSE,
##     or rebuild the rpart model with model=TRUE.

The regression tree model produces an rmse of 5026.74. This is lower than the lasso linear regression model we looked at before, indicating that this model produces a better fit, yet it is still a poor fit to our data. To see if we can find a better decision tree, we are going to tune the cost_complexity parameter.

We can see that rmse increases as the cost complexity parameter increases the rmse also increases and the rsq decreases. This is exactly the opposite of what we want, so we can conclude that lower values the cost complexity parameter produce a better fit model. Now we select the model with the best rmse and fit the model on the whole training set.

## # A tibble: 1 × 2
##   cost_complexity .config              
##             <dbl> <chr>                
## 1         0.00464 Preprocessor1_Model07
## # A tibble: 5 × 5
##   cost_complexity .metric  mean     n std_err
##             <dbl> <chr>   <dbl> <int>   <dbl>
## 1       0.00464   rmse    4645.    10    208.
## 2       0.00167   rmse    4677.    10    184.
## 3       0.000599  rmse    4810.    10    180.
## 4       0.00001   rmse    4848.    10    200.
## 5       0.0000774 rmse    4848.    10    196.

The rmse of our best performing regression tree was 4710. This is lower than the other two models we have considered before, so let’s keep it in mind as we fit the bagging model.

Bagging

## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard       4965.

The bagging model (which is a random forest model but all the predictors are used) produces an rmse of 4852.515 on the testing set. The scatter plot also shows a pretty good model fit as majority of the points land on the line.

Random Forest

Here we can see that as the number of predictors increases, the rmse decreases, which is a positive result. Additionally, the smaller the node size the better the results for the rmse. Let’s see what model parameters produced the best rmse value.

## # A tibble: 5 × 6
##    mtry min_n .metric  mean     n std_err
##   <int> <int> <chr>   <dbl> <int>   <dbl>
## 1     6    50 rmse    4568.    10    176.
## 2     6    61 rmse    4575.    10    178.
## 3     6    55 rmse    4576.    10    177.
## 4     6    66 rmse    4593.    10    181.
## 5     5    50 rmse    4598.    10    177.

Here we can see that the model with the lowest average rmse of 4567.638 has 6 all predictors (mtry = 6) and min_n = 50. This is a high rmse, but it is among the lowest of the models that we have tested so far. Let’s see if a KNN model fits better.

KNN

Here we can see that as the number of nearest neighbors increases, the rmse decreases, which is what we want. The lowest rmse we get is 5500 when the number of nearest neighbors is 5. Let’s see how well the boosted trees model works. # Boosted Trees

## # A tibble: 1 × 4
##    mtry min_n learn_rate .config               
##   <int> <int>      <dbl> <chr>                 
## 1     6    50        0.1 Preprocessor1_Model211
## # A tibble: 5 × 7
##    mtry min_n learn_rate .metric  mean     n std_err
##   <int> <int>      <dbl> <chr>   <dbl> <int>   <dbl>
## 1     6    50        0.1 rmse    6078.    10    196.
## 2     6    60        0.1 rmse    6105.    10    202.
## 3     5    50        0.1 rmse    6188.    10    200.
## 4     6    70        0.1 rmse    6252.    10    208.
## 5     5    60        0.1 rmse    6309.    10    202.

Based off the parameters we chose to specify, we can see that when we use all 6 predictors, the min_n is 50, and learning rate is .01, we get the model with the lowest rmse of 6078.161.

Fitting the Final Model

Now that we have fit all our proposed models to the training data set and found the best rmse values per model, we can see that the random forest model produced the lowest rmse value of 4567 on the training set. Let’s fit this model to our testing set and observe model performance.

The final scatter plot displaying the accuracy of the testing set is relatively good. The higher the charges get, however, the poorer our model predictions are. More of the points fall on the line when the charges are below $20,000. Let’s see which predictors this random forest model deemed the most important. Smoker status (particularly if it was yes) had an overwhelmingly important effect on determining insurance charges. Age, BMI, and children also played an important part. It seems that the southwest region, if anything had a negative impact on variable performance. This aligns with the results of the correlation matric we produced at the beginning of the EDA.

Conclusion

Overall, the final random forest model did a good job of predicting charges below $20,000 on the testing set. The model still produced a very high RMSE value of over 4000. However, all of the other models produced a value greater than this. This high RMSE indicates that our model is not able to account for the important features underlying our data. The regression tree, bagging, random forest and k nearest neighbors performed the best among all our models. The lasso regression, linear regression, and boosted trees performed poorly as they had the highest rmse values. I was not surprised that linear regression or lasso regression did not work well because the EDA revealed that there did not seem to be an obvious linear pattern between charges and any of the other predictors.

My next steps in order to improve this model would be to attain more predictor variables that would help predict charge better. Since our model is failing to account for important underlying features, I am hoping that including a wider variety of predictor variables will increase the chance of producing a model better equipped with a wider range of predictors used to determine insurance costs. Another next step I would look into is creating some interaction terms between certain predictors. There seemed to be no correlation between predictors in my current predictor set, but hopefully with the addition of more predictors, I can find some varialbes that are correlated in order to create interaction terms between them.